Moving to a World beyond p < .05

An introduction to the philosophy of statistics

Ladislas Nalborczyk

LPC, LNC, CNRS, Aix-Marseille Univ

Slides available at https://github.com/lnalborczyk/intro_phil_stat_2022

Program 👋 👋

  • Introduction to the philosophy of statistics: Theories, models, evidence, inference
    • Theoretical and statistical models
    • Statistical evidence and inference
  • Correct and incorrect interpretations of common hypothesis tests
    • P-values and confidence intervals
    • Bayes factors
    • Problems induced by the mindless use of statistics
  • How to move forward: A model comparison (and model criticism) approach
    • Statistical modelling and model comparison
    • A principled Bayesian workflow

Introduction to the philosophy of statistics (why we need statistics in the first place)

Scientific theories

A scientific theory can be defined as a set of logical propositions that posits causal relationships between observable phenomena.

  • Initially broad and abstract: “Every object responds to the force of gravity in the same way”.
  • Then, concrete (testable) predictions: “The falling speed of two objects A and B should be the same, all other things being equal”.

These logical propositions are originally abstract and broad (e.g., “every object responds to the force of gravity in the same way”) but lead to concrete and specific predictions that are empirically testable (e.g., “the falling speed of two objects A and B should be the same, all other things being equal”).

Scientific theories

The concept of a scientific theory is not a unitary concept though. As an example, Meehl (1986) lists three kinds of theories:

  • Functional-dynamic theories which relate “states to states or events to events”. For instance, we say that when one variable changes, certain other variables change in such and such ways.
  • Structural-compositional theories in which the main idea is to explain what something is composed of, or what kind of parts it has, and how they are put together.
  • Evolutionary theories which are about the history and/or development of things (e.g., Darwin’s theory, Wegener’s theory of continental drift, the fall of Rome, etc).

First problem: We can not confirm theories

According to Campbell (1990), the (intuitive) logical argument of science has the following form:

  • If Newton’s theory A is true, then it should be observed that the tides have period B, the path of Mars shape C, the trajectory of a cannonball form D, etc.
  • Observation confirms B, C, and D.
  • Therefore Newton’s theory A is “true”.

However, this argument is a fallacious argument known as the affirmation of the consequent. The invalidity comes from the existence of the cross-hatched area, that is, other possible explanations for B, C, and D being observed (figure from Campbell, 1990).

Second problem: We can not (strictly) falsify theories

We can not confirm theories, but maybe we can at least think of a way of disproving them? According to Popper’s view, a theory can be considered as falsifiable if it can be shown to be false. But what does it mean for a theory to be false?

Here we should note that the falsifiability of early Popper concerns the problem of demarcation (i.e., what is science and what is pseudoscience), and defines pseudosciences as composed of non falsifiable theories (i.e., theories that do not allow the possibility of being disproved).

But when it comes to describe how science works (descriptive purposes) or to know how scientific enquiries should be lead (prescriptive purposes), science is usually not described by the falsification standard, as Popper himself recognised and argued. In fact, deductive falsification is impossible in nearly every scientific context (McElreath, 2016).

In the next sections, we discuss some of the reasons that prevent (almost) any scientific theory to be strictly falsified (in a logical sense), namely: i) the distinction between theoretical and statistical models ii) the problem of measurement iii) the problem of continuous hypotheses, and iv) the Duhem-Quine problem.

1) Theoretical and statistical models

A statistical model is a device that connect theories to data. It can be defined as an instantiation of a theory as a set of probabilistic statements (Rouder et al., 2016).

Theoretical models and statistical models are usually not equivalent as many different theoretical models can correspond to the same probabilistic description. Conversely, different probabilistic descriptions can be derived from the same theoretical model. In other words, there is no one-to-one mapping between the two worlds, which render the induction from the statistical model to the theoretical model quite tricky (figure from McElreath, 2020).

1) Theoretical and statistical inference

Causal and inferential relations between substantive theory, statistical hypothesis, and observational data (figure from Meehl, 1990).

Another problem yet, as stressed by Paul Meehl, is that while statistical methodology usually deals with the issue of assessing the validity of statistical hypotheses from observations, it does not address, and maybe can not address, the issue of assessing the validity of substantive theories from the corroboration or disconfirmation of statistical hypotheses.

2) Measurement error

The logic of falsification is pretty simple and rests on the power of the modus tollens. This argument (whose exposition, for some reason, usually involves swans) can be presented as follows:

  • If my theory \(T\) is right, then I should observe these data \(D\).
  • I observe data that are not those I predicted \(\neg D\).
  • Therefore, my theory is wrong \(\neg T\).

This argument is perfectly valid and works well for logical statements (statements that are either true or false). However, the first problem that arises when we try to apply this reasoning to the “real world” is the problem of observation error: observations are prone to error, especially at the boundaries of knowledge (McElreath, 2016).

2) Measurement error

According to Einstein, neutrinos can not travel faster than the speed of light. Thus, any observation of faster-than-light neutrinos would act as a strong falsifier of Einstein’s special relativity. In 2011 however, a large team of respected physicists announced the detection of faster-than-light neutrinos (cf. the Wikipedia article).

What was the reaction of the scientific community? The dominant reaction was not to claim Einstein’s theory to be falsified but was instead: “How did this team mess up the measurement?” (McElreath, 2016). And they were right to suspect something was wrong with the measurement: A fiber optic cable was attached improperly, and a clock oscillator was ticking too fast…

3) Probabilistic hypotheses

Another problem arises from a misapplication of deductive syllogistic reasoning (a misapplication of the modus tollens). The problem (the “permanent illusion,” as put by Gigerenzer, 1993) is that most scientific hypotheses are not really of the kind “all swans are white” but rather of the form:

  • Ninety percent of swans are white.
  • If my hypothesis is correct, we should probably not observe a black swan.

Given this hypothesis, what can we conclude if we observe a black swan? Not much. To understand why, let’s translate it first to a more common statement in psychological research (from Cohen, 1994):

  • If the null hypothesis is true, then these data are highly unlikely.
  • These data have occurred.
  • Therefore, the null hypothesis is highly unlikely.

But because of the probabilistic premise (i.e., the “highly unlikely”) this conclusion is invalid. Why?

3) Probabilistic hypotheses

Consider the following argument (Cohen, 1994; Pollard & Richardson, 1987):

  • If a person is an American, he is probably not a member of Congress.
  • This person is a member of Congress.
  • Therefore, he is probably not an American.

This conclusion is not sensible (the argument is invalid), because it fails to consider the alternative to the premise, which is that if this person were not an American, the probability of being a member of Congress would be 0.

This is formally exactly the same as:

  • If the null hypothesis is true, then these data are highly unlikely.
  • These data have occurred.
  • Therefore, the null hypothesis is highly unlikely.

Which is as much invalid as the previous argument, because i) the premise (the hypothesis) is probabilistic/continuous rather than discrete/logical and ii) because it fails to consider the probability of the alternative. Thus, even without measurement/observation error, this problem would prevent us from applying the modus tollens to our hypothesis, thus preventing any possibility of strict falsification.

4) The underdetermination problem

Again another problem is known as the Duhem–Quine thesis/problem (aka the underdetermination problem). In practice, when a substantive theory \(T\) happens to be tested, some hidden assumptions, such as auxiliary theories about the instruments we use, are also put under examination (Meehl, 1990, 1978, 1997).

When we test a theory predicting that “if \(O_{1}\)” (some manipulation), “then \(O_{2}\)” (some observation), what we actually mean is that we should observe this relation, if and only if all of the above (i.e., the auxiliary theories, the instrument theories, the particulars, etc) are true.

4) The underdetermination problem

Thus, the logical structure of an empirical test of a theory \(T\) can be described as the following conceptual formula (Meehl, 1990, 1978, 1997):

\[(T \land A_{t} \land C_{p} \land A_{i} \land C_{n}) \to (O_{1} \supset O_{2})\]

where the \(\land\) are conjunctions (“and”), the arrow \(\to\) denotes deduction (“follows that …”), and the horseshoe \(\supset\) is the material conditional (“If \(O_{1}\), Then \(O_{2}\)”). \(A_{t}\) is a conjunction of auxiliary theories, \(C_{p}\) is a “ceteribus paribus” clause (i.e., we assume there is no other factor exerting an appreciable influence that could obfuscate the main effect of interest), \(A_{i}\) is an auxiliary theory regarding instruments, and \(C_{n}\) is a statement about experimentally realised conditions (i.e., we assume that there is no systematic error/noise in the experimental settings).

4) The underdetermination problem

\[(T \land A_{t} \land C_{p} \land A_{i} \land C_{n}) \to (O_{1} \supset O_{2})\]

However, although the modus tollens is a valid figure of the implicative syllogism for logical statements (e.g., “all swans are black”), the neatness of Popper’s classic falsifiability concept is fuzzed up by the acknowledgement of the actual form of an empirical test. Obtaining falsificative evidence during an empirical test does not only falsify the substantive theory \(T\), but it does falsify all the left-side of the above statement. In other words, what we have achieved by our laboratory or correlational “falsification” is a falsification of the combined claims \(T \land A_{t} \land C_{p} \land A_{i} \land C_{n}\), which is probably not what we had in mind when we did the experiment (Meehl, 1990).

To sum up, failing to observe a predicted outcome does not necessarily mean that the theory itself is wrong, but rather that the conjunction of the theory and the underlying assumptions at hand are invalid (Lakatos, 1976; Meehl, 1978, 1997).

Consequences

Falsification in science is almost always consensual, not logical (McElreath, 2020). A theoretical claim is considered to be falsified only when multiple lines of converging evidence have been obtained, by independent teams of researchers, and usually after several years or decades of critical discussion. The “falsification of a theory” then appears as a social result, issued from the community of scientists, and (almost) never as a deductive falsification.

How can we accumulate evidence in favour of or against a theory? That’s were statistics comes into play. There are several philosophical frameworks for statistical inference, which differ by their assumptions and by their definition of what counts as evidence in favour or against a theory.

Correct and incorrect interpretations of common hypothesis tests: p-values and confidence intervals

Null Hypothesis Significance Testing (NHST)

Let’s say we are interested in height differences between women and men…

set.seed(19) # to get reproducible results
men <- rnorm(n = 100, mean = 175, sd = 10) # 100 men heights
women <- rnorm(n = 100, mean = 170, sd = 10) # 100 women heights
t.test(x = men, y = women)

    Welch Two Sample t-test

data:  men and women
t = 2.4969, df = 197.98, p-value = 0.01335
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.7658541 6.5209105
sample estimates:
mean of x mean of y 
 175.1258  171.4825 

None of these definitions is true… 🥺🥺

Null Hypothesis Significance Testing (NHST)

We are going to simulate t-values computed on samples generated under the assumption of no difference between women and men (the null hypothesis \(\mathcal{H}_{0}\)).

nsims <- 1e4 # number of simulations
t <- rep(x = NA, times = nsims) # initialising an empty vector

for (i in 1:nsims) {
    
    men2 <- rnorm(n = 100, mean = 170, sd = 10)
    women2 <- rnorm(n = 100, mean = 170, sd = 10)
    t[i] <- t.test(x = men2, y = women2)$statistic
    
}

Or without for loops.

t <- replicate(n = nsims, expr = t.test(x = rnorm(100, 170, 10), y = rnorm(100, 170, 10) )$statistic)

Null Hypothesis Significance Testing (NHST)

data.frame(t = t) %>%
    ggplot(aes(x = t) ) +
    geom_histogram()

Null Hypothesis Significance Testing (NHST)

data.frame(t = c(-5, 5) ) %>%
    ggplot(aes(x = t) ) +
    stat_function(fun = dt, args = list(df = t.test(men, women)$parameter), size = 1.5) +
    ylab("Probability density")

Null Hypothesis Significance Testing (NHST)

alpha <- .05 # significance threshold (alpha)
abs(qt(alpha / 2, df = t.test(x = men, y = women)$parameter) ) # two-sided critical t-value
[1] 1.972019

Null Hypothesis Significance Testing (NHST)

tobs <- t.test(x = men, y = women)$statistic # observed t-value
tobs %>% as.numeric
[1] 2.496871

P-values

A p-value is simply a tail area (an integral) computed from the distribution of test statistics under (given) the null hypothesis. It gives the probability of observing the data we observed or more extreme data, given that the null hypothesis is true (Wagenmakers, 2007).

\[p[\mathbf{t}(\mathbf{x}^{\text{rep}} ; \mathcal{H}_{0}) \geq t(x)]\]

t.test(x = men, y = women)$p.value
[1] 0.01334509
tvalue <- abs(t.test(x = men, y = women)$statistic)
df <- t.test(x = men, y = women)$parameter
2 * integrate(f = dt, lower = tvalue, upper = Inf, df = df)$value
[1] 0.01334509

Fisher versus Neyman & Pearson

According to Fisher, the p-value is thought to measure the strength of evidence against the null hypothesis: the lower the p-value, the stronger the evidence against the null hypothesis. But we know that p-values at best correlate (in a loose meaning) with evidence (e.g., see Wagenmakers, 2007).

The Fisherian continuous interpretation of p-values has many problems (cf. next slide) and has been widely criticised.

Neyman & Pearson used p-values and significance thresholds as a way of controlling error rates in the long run. In this perspective, we don’t interpret the p-value, we only “classify” results as significant or non-significant. This strict procedure allows keeping error rates at a fixed level (given that the null hypothesis is true, see this blogpost). However, this view also has serious problems. One of the biggest problem being the domain problem (e.g., Trafimow & Earp, 2017).

Logic, frequentism, and probabilistic reasoning

The modus tollens is one of the strongest rule of inference in logic. It works perfectly well in science when we deal with hypotheses of the following form: “If \(\mathcal{H}_{0}\) is true, then we should not observe \(x\). We observed \(x\). Then, \(\mathcal{H}_{0}\) is false”.

BUT, most of the time, we deal with continuous, probabilistic hypotheses…

The Fisherian inference (induction) is of the form: “If \(\mathcal{H}_{0}\) is true, then we should PROBABLY not observe \(x\). We observed \(x\). Then, \(\mathcal{H}_{0}\) is PROBABLY false”.

However, as we have seen previously, this argument is invalid. The modus tollens does not apply to probabilistic statements (e.g., Pollard & Richardson, 1987).

Interpreting confidence intervals

Confidence intervals are basically regions of significance. Thus, they have to be interpreted as cautiously as p-values, and are submitted to the same flaws.

A 95% confidence interval does not mean that there is a 95% probability that the interval contains the population value of the parameter (remember the modus tollens fallacy).

The only correct interpretation is to think about it in terms of coverage proportion (see next slide and this blogpost).

A 95% confidence interval represents a statement about the procedure, not about the parameter. It means that, in the long run, 95% of the confidence intervals we could compute (in an exact replication of the experiment) would contain the population value of the parameter. But we can not say anything about the particular confidence interval we computed in this particular experiment…

Preliminary summary (statistics is hard)

Frequentist statistics (e.g., p-values and confidence intervals) make sense under the frequentist interpretation of probability: they refer to long-run frequencies.

P-values are simply tail areas in probability distributions. It means that they are conditional on some distribution. But it also means that computing a p-value is a generic statistical procedure, it’s not inextricable from the null hypothesis (e.g., see Bayesian p-values).

Confidence intervals are basically regions of significance. Thus, they are prone to the very same limits as p-values.

Correct and incorrect interpretations of common hypothesis tests: Bayes factors

Bayes factors

Instead of testing only one hypothesis (the null hypothesis), Bayes factors allow comparing two hypotheses. For instance, let’s say we are comparing two models:

  • \(\mathcal{H}_{0}: \mu_{1} = \mu_{2} \rightarrow \delta = 0\)
  • \(\mathcal{H}_{1}: \mu_{1} \neq \mu_{2} \rightarrow \delta \neq 0\)

\[\underbrace{\dfrac{p(\mathcal{H}_{0}|D)}{p(\mathcal{H}_{1}|D)}}_\text{posterior odds} = \underbrace{\dfrac{p(D|\mathcal{H}_{0})}{p(D|\mathcal{H}_{1})}}_\text{Bayes factor} \times \underbrace{\dfrac{p(\mathcal{H}_{0})}{p(\mathcal{H}_{1})}}_\text{prior odds}\]

\[\text{evidence}\ = p(D | \mathcal{H}) = \int p(\theta | \mathcal{H}) p(D | \theta, \mathcal{H}) \text{d}\theta\]

The evidence in favour of a model corresponds to the marginal likelihood of a model. In other words, it is an averaged likelihood weighted by the prior predictions of the model, which makes the Bayes factor a kind of Bayesian likelihood ratio.

Bayes factors are the new p-values…

Be careful not to interpret Bayes factors as posterior odds… Bayes factors indicate how much we should update our prior odds, in the light of new incoming data. They do not tell us what is the most probable hypothesis, given the data (unless the prior odds are 1:1).

Let’s take another example:

  • \(\mathcal{H}_{0}\): there is no such thing as precognition
  • \(\mathcal{H}_{1}\): precognition does really exist

We run an experiment and observe a \(\text{BF}_{10} = 27\). What are the posterior odds in favour of \(\mathcal{H}_{1}\)?

\[\underbrace{\dfrac{p(\mathcal{H}_{1}|D)}{p(\mathcal{H}_{0}|D)}}_\text{posterior odds} = \underbrace{\dfrac{27}{1}}_\text{Bayes factor} \times \underbrace{\dfrac{1}{1000}}_\text{prior odds} = \dfrac{27}{1000} = 0.027\]

Problems induced by the mindless use of statistics

Corroborating impossible claims

In 2011, the prestigious Journal of Personality and Social Psychology published this paper written by Daryl Bem, reporting the results of nine experiments, showing evidence in favour of the existence of precognition in humans.

Questionable research practises

Undisclosed flexibility in data collection, analysis, and interpretation dramatically increases the false positive rates (see also the “garden of forking paths” from Gelman & Loken, 2013).

Scientific cycle and questionable research practises

P-hacking

Pressure to produce (e.g., publish), together with widespread misunderstanding of basic concepts in (philosophy of) statistics, have practical/dramatic consequences on the published literature (Figure from Data colada).

Replicability (and reproducibility) issues

This, amongst other things, prompted an era of (healthy) methodological scepticism, which resulted in a through re-evaluation of classical findings from Psychology and experimental sciences overall. The average replication rate in Psychology was estimated to be around 30%.

Registered reports: Reducing publication bias

Publication bias is defined as the selective publication of findings based on the obtained p-value (or another statistical index). How can we avoid this biased reporting? A simple idea is not to base the accept/reject decision on the statistical significance of the results, but rather on the theoretical relevance and methodological rigour of the study (figure from https://www.cos.io/initiatives/registered-reports).


Registered reports: Reducing publication bias

The ATOM guidelines

Do not say “statistically significant”

In 2019, The American Statistician published a special issue on “Moving to a World Beyond p<.05”, with the intention to provide new recommendations for users of statistics (e.g., researchers, policy makers, journalists). This issue comprises 43 original papers aiming to provide new guidelines and practical alternatives to the “mindless” use of statistics. In the accompanying editorial, Wasserstein et al. (2019) provide a first practical recommendation.

We conclude, based on our review of the articles in this special issue and the broader literature, that it is time to stop using the term “statistically significant” entirely. Nor should variants such as “significantly different”, “p < 0.05,” and “nonsignificant” survive, whether expressed in words, by asterisks in a table, or in some other way.

ATOM guidelines

Then, they summarise their practical recommendations in the form of the ATOM guidelines:

  • Accept uncertainty: we must “countenance uncertainty in all statistical conclusions, seeking ways to quantify, visualize, and interpret the potential for error” (Calin-Jageman & Cumming, 2019).
  • Be Thoughtful: we clearly distinguish between confirmatory (preregistered) and exploratory (non-preregistered) statistical analyses. We routinely evaluate the “validity” of the statistical model and we are suspicious of statistical defaults.
  • Be Open: we try to be exhaustive in the way we report our analyses and we beware of short-cuts than could hinder important information to the reader.
  • Be Modest: we recognise that there is no unique “true statistical model” and we discuss the limitations of our analyses and conclusions. We also recognise that scientific inference is much broader than statistical inference and we try not to conclude anything from a single study without the warranted uncertainty.

How to move forward: A model comparison (and model criticism) approach

Common statistical tests are model comparisons

First insight: Common statistical “tests” (e.g., t-test, ANOVA) can be restated as comparisons of regression models.

Instead of (or in supplement to) binary conclusions, we also consider how sounds are the models we are comparing, given the phenomenon at hand.

Model comparison and out-of-sample predictive accuracy

Second insight: Instead of comparing unrealistic models (e.g., the “null hypothesis” and the unspecified/default “alternative hypothesis” models), let’s compare interesting models, embodying theoretical hypotheses of interest.

General steps of the model selection approach usually consist in establishing a set of \(R\) relevant models, ranking these models (and attributing them weights) using an information criterion, and choosing the best model from the model set to make an inference from this best model. Alternatively, one can make inference from a weighted average of the models’ predictions (aka model averaging or multimodel inference).

Model comparison and out-of-sample predictive accuracy

Hirotugu Akaike noticed that the negative log-likelihood of a model + 2 times its number of parameters was approximately equal to the out-of-sample deviance of a model…

\[\text{AIC} = \underbrace{-2\log(\mathcal{L}(\hat{\theta}|\text{data}))}_\text{in-sample deviance} + 2K\]

In-sample deviance: how bad is a model to explain the current dataset (the dataset that we used to fit the model).

Out-of-sample deviance: how bad is a model to explain a future dataset issued from the same data generating process (the same population).

Philosophical oecumenism: Statistical toolbox

Different statistical tools rest on different philosophical frameworks and aim to answer different questions.

Quantifying the relative evidence for a hypothesis/model

↳ Use Bayes factors or likelihood ratios (do not use p-values for this).

Making decisions while controlling error rates in the long-run

↳ Use NHST & p-values (à la Neyman-Pearson) (do not use Bayes factors for this).

Comparing the (out-of-sample) predictive abilities of models

↳ Use information criteria (e.g., AIC, WAIC).

Towards a principled workflow: Statistical rethinking

Making use of the toolbox and pushing further the statistical modelling and model comparison approach. The focus is on building models, validating them (both against prior knowledge and new observations), comparing them, and using them for prediction and/or inference.

A full course on Bayesian statistical (thinking and) modelling is available freely on Youtube, see the Github repository for more details: https://github.com/rmcelreath/stat_rethinking_2022.

Applying this to empirical data in cognitive sciences

Applying this to empirical data in cognitive sciences

A brief summary

Sensible data analysis is not a linear process. A typical workflow involves the following steps:

  • Think hard about a (or several) plausible data-generating process(es).
  • Think hard about the available prior knowledge and encode it into your model(s).
  • Validate these assumptions using simulation (e.g., prior predictive checking).
  • If deemed appropriate, fit the model(s) and update prior knowledge using the Bayesian machinery.
  • Assess the validity of the model(s) using simulation (e.g., posterior predictive checking).
  • Compare various interesting and competing models.
  • Make inference about interesting quantities (using as many statistical indexes as needed).
  • This is not a linear process, feedback loops are often needed between these steps.

Further resources

The special issue on “Statistical Inference in the 21st Century: A World Beyond p < 0.05”: https://www.tandfonline.com/toc/utas20/73/sup1.

Everything is fucked: The syllabus, https://hardsci.wordpress.com/2016/08/11/everything-is-fucked-the-syllabus/.

Some examples of ATOMised reporting of statistical modelling (from my own work): https://pubs.asha.org/doi/abs/10.1044/2018_JSLHR-S-18-0006, https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0233282, https://journals.sagepub.com/doi/abs/10.1177/0956797619900336.

Introduction to the Meehlian Corroboration-Verisimilitude theory of science: https://www.barelysignificant.com/post/corroboration1/ and https://www.barelysignificant.com/post/corroboration2/.

The materials of my doctoral course on Bayesian statistical modelling (in French): https://www.barelysignificant.com/IMSB2022/.

Take-home messages


Don’ts

  • Do not say “statistically significant” (but you can use p-values to control error rates).
  • Do not dichotomise or trichotomise statistical results.

Dos

  • Read, digest, and teach some philosophy of statistics and statistical modelling (vs. testing).
  • Accept uncertainty. Be thoughtful, open, and modest.

  lnalborczyk   lnalborczyk   https://osf.io/ba8xt   www.barelysignificant.com

References

Calin-Jageman, R. J., & Cumming, G. (2019). The New Statistics for Better Science: Ask How Much, How Uncertain, and What Else Is Known. The American Statistician, 73(sup1), 271–280. https://doi.org/10.1080/00031305.2018.1518266
Campbell, D. T. (1990). The meehlian corroboration-verisimilitude theory of science. Psychological Inquiry, 1(2), 142–147. https://www.jstor.org/stable/1448769
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49(12), 997–1003. https://doi.org/10.1037/0003-066X.49.12.997
Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no fishing expedition or p-hacking and the research hypothesis was posited ahead of time∗. 17.
Gigerenzer, G. (1993). The superego, the ego, and the id in statistical reasoning (pp. 311–339). Lawrence Erlbaum Associates, Inc.
Lakatos, I. (1976). Falsification and the methodology of scientific research programmes (S. G. Harding, Ed.; pp. 205–259). Springer Netherlands. https://doi.org/10.1007/978-94-010-1863-0_14
McElreath, R. (2016). Statistical rethinking: A bayesian course with examples in r and stan. CRC Press/Taylor & Francis Group.
McElreath, R. (2020). Statistical rethinking: A bayesian course with examples in r and stan (2nd ed.). Taylor; Francis, CRC Press.
Meehl, P. E. (1990). Appraising and amending theories: The strategy of lakatosian defense and two principles that warrant it. Psychological Inquiry, 1(2), 108–141. https://doi.org/10.1207/s15327965pli0102_1
Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir karl, sir ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46(4), 806–834. https://doi.org/10.1037/0022-006X.46.4.806
Meehl, P. E. (1986). What Social Scientists Dont Understand (D. W. Fiske & R. A. Shweder, Eds.; p. 24). Chicago: University of Chicago Press.
Meehl, P. E. (1997). The problem is epistemology, not statistics: Replace significance tests by confidence intervals and quantify accuracy of risky numerical predictions. What If There Were No Significance Tests?, 393–425. http://citeseerx.ist.psu.edu/viewdoc/citations;jsessionid=72FF987997EFB5F0602B02E1A2E04E40?doi=10.1.1.693.9583
Pollard, P., & Richardson, J. T. (1987). On the probability of making type i errors. Psychological Bulletin, 102(1), 159–163. https://doi.org/10.1037/0033-2909.102.1.159
Rouder, J. N., Morey, R. D., & Wagenmakers, E.-J. (2016). The Interplay between Subjectivity, Statistical Practice, and Psychological Science. Collabra, 2(1), 6. https://doi.org/10.1525/collabra.28
Trafimow, D., & Earp, B. D. (2017). Null hypothesis significance testing and Type I error: The domain problem. New Ideas in Psychology, 45, 19–27. https://doi.org/10.1016/j.newideapsych.2017.01.002
Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems ofp values. Psychonomic Bulletin & Review, 14(5), 779–804. https://doi.org/10.3758/bf03194105
Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a world beyond p < 0.05. The American Statistician, 73(sup1), 1–19. https://doi.org/10.1080/00031305.2019.1583913